Cluster Independent Annotation (CIA) is a classification method meant to help researchers in the cell annotation step of scRNA-seq experiments. Exploiting two functions, given cell type signatures as input, this classifier computes a signature score for each cell (_signaturescore) and (with different modalities) it compares the score values in order to assign a label to each single cell (_signature_basedclassification).
This tool provides several advantages:
It synthesizes the information of a whole signature expression in a single score value, skipping the tedious one by one inspection of marker genes taken from long differentially expressed genes (DEGs) lists, which furthermore may not be cluster specific when taken singularly.
It provides a classification for each cell that is completely independent from clustering, thus, it can be used in parallel with a clustering method in order to set a proper resolution value, obtaining so coherent and easy to annotate cell groups.
Given the implemented modalities, it can be very fast and classify a huge dataset (hundreds of thousands cell) in a bunch of seconds, or it can be statistically more reliable exploiting the comparison of obtained signature scores with randomic signature ones. Since the latter method is of course computationally heavier, we implemented the possibility to parallelize processes.
Being signature-based, this tool can provide information about any kind of gene list with a biological meaning, allowing also functional annotation.
Normalizing for the gene signature length, it enables the comparison of genesets with different length, spanning from tens to thousands genes (obviously with different significance).
CIA is composed by two main modules:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import multiprocessing
from functools import partial
from scipy.sparse import issparse
from scipy import sparse
import time
from sklearn import metrics
from scipy import sparse
import itertools
import matplotlib.pyplot as plt
from cia import investigate, report, utils
A representative workflow is shown in this tutorial, in which we exploited CIA functions to automatically annotate PBMC3K scRNA-seq datasets at cellular level, starting from a set of gene signatures extracted from a PBMC atlas obained with CITE-seq method.
Our method requires as input an AnnData object with a raw expression matrix (AnnData.raw.X) and a dictionary containing the names of the signatures (e.g. cell type, cell state ...) as keys and correspondent gene names as values. With the intent of standardize the analysis, we strongly suggest to use AnnData objects preprocessed following those Scanpy tutorial steps:
For demonstration purposes, we decided to use as signatures the DEGs of Hao et al., 2021 [1] PBMC atlas. This dataset clusters have been confidently annotated exploiting “weighted-nearest neighbor” framework, an integrative analysis which takes into account both RNA and protein level information. This approach ensures that the extracted gene lists (RNA-based only) are actually associated to a specific cell type.
# To download the dataset (2.536 GB)
!wget https://datasets.cellxgene.cziscience.com/b0381820-6536-487a-85d2-b5994ae0f1c8.h5ad -O data/b0381820-6536-487a-85d2-b5994ae0f1c8.h5ad
# to read the atlas data
atlas= sc.read('data/b0381820-6536-487a-85d2-b5994ae0f1c8.h5ad')
atlas
# to start from a count matrix
atlas.X = atlas.layers['corrected_counts']
atlas.X.todense()
# to normalize the data
sc.pp.normalize_total(atlas, target_sum=1e4)
sc.pp.log1p(atlas)
# to set AnnData.raw attribute
atlas.raw= atlas
This dataset has been annotated at 3 different levels of granularity. We focused on the coarser one, both for visualization purposes and to facilitate the comparisons with other datasets annotated at lower resolution.
sc.pl.umap(atlas, color='celltype.l1')
sc.pl.umap(atlas, color=['celltype.l2', 'celltype.l3'], wspace=1)
Since labels 'other' and 'other T' don't refer to any cell type in particular, we checked their identity in the higher resolution annotation.
ax=sc.pl.umap(atlas, show=False )
sc.pl.umap(atlas[(atlas.obs['celltype.l2']=='gdT') | (atlas.obs['celltype.l2']=='MAIT') |
(atlas.obs['celltype.l2']=='dnT') | (atlas.obs['celltype.l2']=='Platelet')], color='celltype.l2', ax=ax)
'other T' cluster is composed by double negative T cells, gamma delta T cells and MAIT cells. Those cells are not annotaded in PBMC3K, therefore there is no way to determine wether their eventual finding is correct or not. Since our intent is to use PBMC3K as test dataset, we removed 'other T' cells from the analysis. 'other' cluster is mainly composed by platelets and for this reason has been renamed 'Platelet'.
# to remove 'other T' cells
atlas=atlas[atlas.obs['celltype.l1']!='other T']
atlas.obs['Cell type']=atlas.obs['celltype.l1']
# to rename clusters
atlas.obs['Cell type']=atlas.obs['Cell type'].cat.rename_categories(['B', 'CD4 T', 'CD8 T', 'DC', 'Mono', 'NK', 'Platelet'])
atlas.uns['Cell type_colors']=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2']
atlas.obs['Cell type'].cat.categories
We performed differential expression analysis (DEA) on coarser resoluted clusters and we selected as DEGs genes having at least 1.5 log2FC, with minimal mean of expression 0.25, z-score 5 and expressed at least in 40% of cells within the cluster, in order to obtain shorter but more specific gene lists than a usual DEA.
atlas
sc.tl.rank_genes_groups(atlas, groupby='Cell type')
# to filter differentially express genes with cia.utils.filter_degs()
gmt=utils.filter_degs(atlas, groupby='Cell type', uns_key='rank_genes_groups', logFC=1.5, scores=5, perc=40,
mean=0.25, direction='up')
for i in gmt.keys():
print(i + ' signature has '+ str(len(gmt[i])) +' genes')
We then assessed the overlap among lists and, despite the stringent DEA, we found that CD8 T and CD4 T cells signatures show a similarity that is remarkably higher than similarities among all the other signatures (50% of DEGs shared with CD4 T). The authors themselves claimed that those populations are difficult to be effectively discriminated by scRNA-seq alone, in particular CD8 T and CD4 T cells. Citing them: "We found that for CD8+ T cells, the most similar RNA neighbors often reflected a mix of CD8+ and CD4+ T cells (in the RNA KNN graph, there are a total of 944 incorrect edges that connect CD8+ to CD4+ T cells). By contrast, protein neighbors were predominantly correctly identified as CD8+ T cells (in the protein KNN graph, 12 CD8+/CD4+ edges were identified)" [1]. For this reason we can't consider those signatures completely specific and so we don't expect to obtain an ideal classification for those cell types.
# to check gene lists similarity with cia.utils.signatures_similarity()
utils.signatures_similarity(gmt, show='J')
Since signature refinement is not the intent of this package and of this tutorial, and it should be done before the annotation/classification step, we proceeded using the above described signatures to show the usages of the proposed functions, and to report the automatic annotation performances.
In order to evaluate both the consistency of our method and the performances of classification, we used the PMBC atlas DEGs to automatically annotate the PBMC3K dataset from Satija et al., 2015 [3]. This dataset was annotated by the authors relying on clustering and marker genes expression inspection and it is widely used as reference in the scientific community. We classified this dataset independently from the already present annotation, whose cell labels were used as ground truth to evaluate our classification perfomances with different modalities.
# to load the test dataset
pbmc3k=sc.read('data/pbmc3k.h5ad')
# to normalize and logarithmize the values
sc.pp.normalize_total(pbmc3k, target_sum=1e4)
sc.pp.log1p(pbmc3k)
# to set the .raw attribute
pbmc3k.raw=pbmc3k
# to compute HVG
sc.pp.highly_variable_genes(pbmc3k, min_mean=0.0125, max_mean=3, min_disp=0.5)
# to subset features
pbmc3k=pbmc3k[:, pbmc3k.var.highly_variable]
pbmc3k
pbmc3k.uns['Cell type_colors']=['#b2df8a', '#fb9a99', '#1f78b4', '#E4D00A', '#6fadfd', '#d62728','#FF5733']
For the classification of the test dataset we renamed and merged some clusters in order to make easier the comparison and the visualization of results. In particular, ‘CD14+ Monocytes' and 'FCGR3A+ Monocytes' clusters were merged into ‘Mono’.
sc.pl.umap(pbmc3k, color=['louvain','Cell type'], wspace=0.6)
signature_score function is based on "gene signature score" calculation method shown in Della Chiara, Gervasoni, Fakiola, Godano et al., 2021 [2]. The "gene signature score" of a signature S in a cell C is defined as follows:
${\bf{GSS(C)}}=\frac{{n}}{L}\frac{{\sum }^n_{i=1}{\bf{X}}i\: for\:{\bf{G}}i \:in \:S}{{\sum }^m_{j=1}{\bf{X}}j\: for\:{\bf{G}}j \:in \:G}$
Where:
With this score it is possible to condensate in a single value both the proportion of expressed signature genes and their overall expression, enabling researchers to easily study whole signatures expression at single cell level.
The raw score is the very same score described in the above mentioned paper, it's the default score_mode in signature_score function.
The function both supports signature dictionaries and canonical gmt files (tab separated and without header), provided as filepath or url.
investigate.signature_score(data=atlas, signatures_dict=gmt, score_mode='raw')
investigate.signature_score(data=atlas, signatures_dict='./data/atlas.gmt', score_mode='raw')
Signature scores are stored by default in AnnData.obs, adding a column for each signature, named accordingly to the signature name.
atlas.obs[gmt.keys()]
Alternatively, signature scores can be directly returned as an array by specifying return_array=True.
investigate.signature_score(data=atlas, signatures_dict=gmt, score_mode='raw', return_array=True)
Scaled score is the raw score divided by max score value, operation which rescales the values from 0 to 1. Scaled scores of different signatures, with different length, can thus be directly compared. To compute the 'scaled' score, score_mode must be set as 'scaled'.
To assess the capability of extracted signatures to discriminate cell types, we compared scaled score distributions.
investigate.signature_score(data=atlas, signatures_dict=gmt, score_mode='scaled')
sc.pl.umap(atlas, color='Cell type')
By inspecting the score values, for all the signatures, the highest values are found in the proper cluster, indicating the sensitivity of the signatures and the capability of the signature score to represent the expression of the whole gene lists.
sc.pl.umap(atlas, color=gmt.keys(), color_map='Reds')
CD4 T and CD8 T scores, as expected, show an overlap reflecting their similarity. In particular both signatures are highly expressed in CD8 Naive subcluster.
ax=sc.pl.umap(atlas, show=False )
sc.pl.umap(atlas[(atlas.obs['celltype.l2']=='CD8 Naive')], color='celltype.l2', ax=ax)
Inspecting the violin plots showing the distribution of score values, as expected, it seems that there is a cluster in which values are higher than all the others for each signature.
sc.pl.violin(atlas, keys=gmt.keys(), groupby='Cell type')
To better viusualize those distributions, we exploited grouped_distributions. By selecting AnnData.obs columns containing signature scores, this function plots a heatmap showing the medians of their values in cell groups (each of the above shown violin plot information is condensed in a heatmap column) and it prints a statistical report. For each cell group, a two-sided Wilcoxon test is perfomed to evaluate if the distribution with the highest median is different from the others. For each signature, a two-sided Mann-Whitney U test is performed to evaluate if the distribution in the cell group having the highest median is different from the other groups distributions.
report.grouped_distributions(atlas, groups_obs='Cell type', columns_obs=gmt.keys(), scale_medians='column-wise', cmap='Reds')
The statistical tests confirmed that the visible differences in signature score distributions are significant, indicating that scaled signature scores are consistent with authors annotation. With the evidence of the goodness of the signatures, we proceeded with the classification of PBMC3K.
To classify the test dataset we used signature_based_classification. Two main classification modalities have been implemented: a fast classification mode, in which scaled scores are directly computed and compared, and a FC score-based mode that, relying on the comparison between raw signature scores and random signature scores, can provide a more confident classification at the cost of computation time.
report.grouped_distributions(atlas, columns_obs=list(gmt.keys()), groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
Fast classification is performed by assigning to each cell the label of the signature with the max scaled score value. Being based on matrices and vectors operations, this computation is very fast.
investigate.signature_based_classification(data=pbmc3k, signatures_dict='./data/atlas.gmt',
fast_mode=True, obs_name='Prediction fast mode')
sc.pl.umap(pbmc3k, color=gmt, cmap='Reds', save='_scaled_scores.pdf')
report.grouped_distributions(pbmc3k, groups_obs='Cell type', columns_obs=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet'], scale_medians='column-wise', cmap='Reds')
The signature score distribution of Dendritic Cells (DC) and Monocytes (Mono) were not significantly different within the manually annotated dendritic cells cluster (Wilcoxon test). However, since for each distribution, there is only a cell group in which values are significantly higher with respect to all the other groups (Mann-Whitney U test), the simple inspection of PBMC3K UMAP showing the different signature scores is sufficient to highlight the sought cell populations. This confirmed the consistency of the signatures across datasets and suggested that the scaled score could be per se sufficient to help researchers in an eventual annotation, permitting to skip the one-by-one evaluation of DEGs reported in literature as marker genes.
By inspecting and comparing the ground truth and the predicted labels is evident that the vast majority of the cells was correctly classified.
pbmc3k.uns['Prediction fast mode_colors']= ['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A','#FF5733']
sc.pl.umap(pbmc3k, color=['Cell type','Prediction fast mode'], wspace=0.5)
report.group_composition(pbmc3k, classification_obs='Prediction fast mode', groups_obs='Cell type',
columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet'], cmap='Greens')
FC score-based classification is performed by looking for the max Fold Change score value for each cell instead of scaled score one. FC score is computed by dividing the raw signatue score by the median of raw scores of a given number (n_iter) of random signatures. To generate random signatures, all the genes of the dataset are ranked by their mean expression and are assigned to a given bin, and for each signature gene a random gene of the same expression bin is picked.
Standard mode is performed with default parameters: all the FC scores lower than 1 (random score < signature score) are turned into 0, and if a cell has a FC score equal to 0 for all the signature, it will not be assigned to any cell type. This approach is slower with respect to fast mode and it could be affected by a decrease of performances due to unassigned cells, but in this way the classified cells are labelled with an higher confidence and cells with low FCs (e.g. not fully differentiated) are not forced to be classified. To speed-up the process it's possible to parallelize the computations allocating jobs to a given number of processors (n_proc). By setting new_score='FC', raw scores in AnnData.obs can be replaced by FC values.
investigate.signature_based_classification(data=pbmc3k, signatures_dict=gmt,
n_iter=500, n_proc=32,new_score='filtered', obs_name='Prediction standard mode', FC_threshold= 1, n_bins=25)
Beside the possibility to substitute raw score values in AnnData.obs, by default both FC scores and filtered FC scores are stored in AnnData.uns.
pbmc3k.uns["signature_based_classification"]
df=pbmc3k.uns['signature_based_classification']
df=df.iloc[:, 14:21]
df.index=pbmc3k.obs.index
pbmc3k.obs[df.columns]=df
filtered_FC_pbmc3k=['CD4 T_filtered','B_filtered','CD8 T_filtered', 'NK_filtered','Mono_filtered','DC_filtered','Platelet_filtered']
By inspecting the distributions of filtered FC scores, it is even more evident that each signature is preferentially expressed by one cell group.
report.grouped_distributions(pbmc3k, columns_obs=filtered_FC_pbmc3k, groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
Indeed, the majority of each cell group has been correctly classified.
pbmc3k.uns['Prediction standard mode_colors']= ['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A',
'#FF5733', '#762a83']
sc.pl.umap(pbmc3k,color=['Cell type', 'Prediction standard mode'], wspace=0.5)
report.group_composition(pbmc3k, classification_obs='Prediction standard mode', groups_obs='Cell type',
columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet'], cmap='Greens')
No major differences have been detected between fast and standard classification in both datasets, suggesting a per se validity of scaled score to perform a fast explorative cell type prediction.
Quantile-based filter adds another layer of stringency to the classification by acting on FC distributions. By setting q equal to a number from 0 to 1, all the FC score values below that given quantile are set 0. In this way only the highest values of FC are mainteined and so only the most signature-expressing cells will be classified. This could be very useful in case of bimodal distribution of score values or when a signature is moderately expressed in all the clusters.
investigate.signature_based_classification(data=pbmc3k, signatures_dict=gmt,
n_iter=500, n_proc=32, obs_name='Prediction q', q=0.50, n_bins=25)
pbmc3k.uns['Prediction q_colors']=['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A',
'#FF5733', '#762a83']
sc.pl.umap(pbmc3k,color=['Cell type', 'Prediction q'], wspace=0.5)
df=pbmc3k.uns['signature_based_classification']
df=df.iloc[:, 14:21]
df.index=pbmc3k.obs.index
pbmc3k.obs[df.columns]=df
report.grouped_distributions(pbmc3k, columns_obs=filtered_FC_pbmc3k, groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
report.group_composition(pbmc3k, classification_obs='Prediction q', groups_obs='Cell type',
columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet', 'Unassigned'], cmap='Greens')
Again, no major differences have been found, indicating the consistency of the signatures. As expected, the clusters with the highest variability are CD8 T and CD4 T, which, as mentioned before, have aspecific signatures.
The most stringent level of classification is given by p-val mode. As mentioned before, FC score is obtained by dividing raw signature scores for the median of scores of n_iter * randomic signatures. For each signature, for each cell, a p-value is calculated by counting how many times random signature values are higher than gene signature one (all divided by n_inter). If p value is higher than the set p parameter, the FC score of that cell for that signature will become 0. In this way only the significantly signature-expressing cells will be assigned.
investigate.signature_based_classification(data=pbmc3k, signatures_dict=gmt,
n_iter=500, n_proc=32, obs_name='Prediction p-val', p=0.05, n_bins=25, new_score='FC')
pbmc3k.uns['signature_based_classification']
pbmc3k.uns['Prediction p-val_colors']=['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A',
'#FF5733', '#762a83']
sc.pl.umap(pbmc3k, color=['Cell type','Prediction p-val'], wspace=0.5)
sc.pl.umap(pbmc3k, color=filtered_FC_pbmc3k, cmap='Reds')
df=pbmc3k.uns['signature_based_classification']
df=df.iloc[:, 14:21]
df.index=pbmc3k.obs.index
pbmc3k.obs[df.columns]=df
report.grouped_distributions(pbmc3k, columns_obs=filtered_FC_pbmc3k, groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
plt.savefig('figures/heatmap_pval.pdf')
report.group_composition(pbmc3k, classification_obs='Prediction p-val', groups_obs='Cell type',
columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet', 'Unassigned'], cmap='Greens')
Again, no major differences were found with the statistically relevant p value mode, suggesting the consistency of our classification method.
To evaluate classification performances both per ground truth cluster and overall we exploited respectively grouped_classification_metrics and classification_metrics functions.
In both functions cell labels assigned by CIA and the annotation already present in test datasets are compared in order to count true positive (TP), true negative (TN), false positive (FP) and false negative (FN) cells for each cluster. Only for the overall calculation the per-cluster counts are summed to obtain the total TN, TP, FN and FP.
Then, again for both functions, the following metrics are calculated:
N.B.: the column of the classification of interest and the one with ground truth labels must have the same categories to be compared.
Since FC-based classification exploits random signatures generation classification results may slightly differ each time the kernel is restarted. For this reason we saved the labels of the classification in 'data/pbmc3k_classified.h5ad'. From now on, for reproducibility pourposes, we use those results to calculate performances.
pbmc3k=sc.read('data/pbmc3k_classified.h5ad')
Here, for clarity, we show only the per-cluster classification metrics of the statistically relevant p value-based method for both datasets.
report.grouped_classification_metrics(pbmc3k, classification_obs='Prediction p-val',groups_obs='Cell type')
And here are reported the overall perfomances of each classification modality:
report.classification_metrics(pbmc3k,
classification_obs=['Prediction fast mode', 'Prediction standard mode', 'Prediction q', 'Prediction p-val'],
groups_obs='Cell type')
Signature score is confirmed to be a powerful way to condensate the information of a whole gene signature expression at single cell level. The only differenceres from the expected score distributions were attributed to the high transcriptional similarity (reported by the authors themselves) of some PBMC atlas cell group (from which we extracted the signatures), which have been previously defined also exploiting protein level information. This further suggested the capability of this score to clearly highlight the goodness of the transcriptional signature itself.
Signature based classification resulted to be effective in the automatic annotation of scRNA-seq datasets with external signatures. In particular, considering the overall performances of CIA in the PBMC3K test dataset:
In our package, besides the classification tool, we also implemented a module of functions which allows to easily compare classification methods and evaluate score distributions in cell groups (also obtained with other packages).
☛ To visualize how CIA performs with respect to other classifiers compatible with Scanpy, see CIA benchmarking tutorial.